URL/text analysis engine for clean content extraction & scoring.
Track 02 · Data Prep. The text preprocessing utility built for the Research Agent. Content-Analyzer strips layout noise (navigation, sidebars, footers, scripts) from raw HTML pages, truncates text to fit token budgets, sends it to GPT-4o-mini, and parses the response into structured JSON scores. Extracted from production Agentic OS.
Feeding raw HTML web pages directly into an LLM context is expensive and error-prone. Standard web pages are packed with boilerplate markup: navigation menus, site footers, sharing buttons, sidebar ads, tracking scripts, and styling rules. This noise can represent up to 80% of the raw characters, diluting the actual information and inflating your API costs.
To score sources fast and accurately, you need an extraction engine that strips the markup hierarchy, isolates the core text body, and runs a fast structured evaluation to filter out irrelevant or low-quality articles before they reach your synthesis pipeline.
<article> and <main> to extract raw content.
{
"summary": "A single-sentence summary of the page content.",
"key_insights": [
"Insight bullet point 1",
"Insight bullet point 2",
"Insight bullet point 3"
],
"sentiment": "positive | negative | neutral",
"relevance_score": 8, // scale of 0-10
"quality_score": 7, // scale of 0-10
"content_type": "news | analysis | tutorial | report | marketing"
}
src/extractor.py: Contains URL loading routines, User-Agent rotation parameters, BeautifulSoup filters for stripping markup noise, and character slice boundaries.src/analyzer.py: Handles client calls to the OpenAI SDK, wraps inputs in a JSON schema template, and catches decoding errors.demo/run.py: A test command line script showing how to scrape a single webpage and print the resulting metadata block.git clone https://github.com/shubham0086/content-analyzer cd content-analyzer pip install -r requirements.txt cp .env.example .env # Run simple test parser python demo/run.py
Content-Analyzer runs as a **data ingestion utility**. It acts as the core scoring node inside the Research Agent. By analyzing each webpage candidate beforehand, the Research Agent can filter out low-scoring or promotional links, only passing high-quality content summaries to the final synthesis models.
This parser relies on CSS class identifiers to strip boilerplate. Sites that render content entirely client-side via complex JavaScript (Single Page Applications without server-side pre-rendering) will return blank pages to simple requests. To scrape SPAs reliably, you would need to hook up a headless browser (like Playwright), which adds significant memory and runtime overhead. For standard blog posts and news publications, this static scraper retrieves contents in under 200ms.